The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
As a Data scientist at Thera bank, we need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
We need to identify the best possible model that will give the required performance
CLIENTNUM: Client number. Unique identifier for the customer holding the accountAttrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"Customer_Age: Age in YearsGender: Gender of the account holderDependent_count: Number of dependentsEducation_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.Marital_Status: Marital Status of the account holderIncome_Category: Annual Income Category of the account holderCard_Category: Type of CardMonths_on_book: Period of relationship with the bankTotal_Relationship_Count: Total no. of products held by the customerMonths_Inactive_12_mon: No. of months inactive in the last 12 monthsContacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 monthsCredit_Limit: Credit Limit on the Credit CardTotal_Revolving_Bal: The balance that carries over from one month to the next is the revolving balanceAvg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)Total_Trans_Amt: Total Transaction Amount (Last 12 months)Total_Trans_Ct: Total Transaction Count (Last 12 months)Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarterTotal_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarterAvg_Utilization_Ratio: Represents how much of the available credit the customer spent!pip install xgboost
!pip install imblearn
!pip install pandas-profiling
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
!pip install lightgbm
import lightgbm as lgb
from sklearn.dummy import DummyClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
# plot_confusion_matrix,
#plot_roc_curve,
)
# To be used for data scaling and encoding
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
OneHotEncoder,
RobustScaler,
)
from sklearn.impute import SimpleImputer
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# set the background for the graphs
plt.style.use("ggplot")
# For pandas profiling
# from pandas_profiling import ProfileReport
# Printing style
!pip install tabulate
from tabulate import tabulate
# To supress warnings
import warnings
# date time
from datetime import datetime
warnings.filterwarnings("ignore")
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.1.3) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.26.4) Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.10/dist-packages (from xgboost) (2.23.4) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.13.1) Requirement already satisfied: imblearn in /usr/local/lib/python3.10/dist-packages (0.0) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.12.4) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.26.4) Requirement already satisfied: scipy>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.13.1) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.5.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.5.0) Requirement already satisfied: pandas-profiling in /usr/local/lib/python3.10/dist-packages (3.6.6) Requirement already satisfied: ydata-profiling in /usr/local/lib/python3.10/dist-packages (from pandas-profiling) (4.12.1) Requirement already satisfied: scipy<1.14,>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (1.13.1) Requirement already satisfied: pandas!=1.4.0,<3,>1.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (2.2.2) Requirement already satisfied: matplotlib<3.10,>=3.5 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (3.8.0) Requirement already satisfied: pydantic>=2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (2.10.3) Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (6.0.2) Requirement already satisfied: jinja2<3.2,>=2.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (3.1.4) Requirement already satisfied: visions<0.7.7,>=0.7.5 in /usr/local/lib/python3.10/dist-packages (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling->pandas-profiling) (0.7.6) Requirement already satisfied: numpy<2.2,>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (1.26.4) Requirement already satisfied: htmlmin==0.1.12 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (0.1.12) Requirement already satisfied: phik<0.13,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (0.12.4) Requirement already satisfied: requests<3,>=2.24.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (2.32.3) Requirement already satisfied: tqdm<5,>=4.48.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (4.66.6) Requirement already satisfied: seaborn<0.14,>=0.10.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (0.13.2) Requirement already satisfied: multimethod<2,>=1.4 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (1.12) Requirement already satisfied: statsmodels<1,>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (0.14.4) Requirement already satisfied: typeguard<5,>=3 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (4.4.1) Requirement already satisfied: imagehash==4.3.1 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (4.3.1) Requirement already satisfied: wordcloud>=1.9.3 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (1.9.4) Requirement already satisfied: dacite>=1.8 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (1.8.1) Requirement already satisfied: numba<1,>=0.56.0 in /usr/local/lib/python3.10/dist-packages (from ydata-profiling->pandas-profiling) (0.60.0) Requirement already satisfied: PyWavelets in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling) (1.8.0) Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling) (11.0.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.2,>=2.11.1->ydata-profiling->pandas-profiling) (3.0.2) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (1.3.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (4.55.3) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (1.4.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (24.2) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (3.2.0) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (2.8.2) Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba<1,>=0.56.0->ydata-profiling->pandas-profiling) (0.43.0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling->pandas-profiling) (2024.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas!=1.4.0,<3,>1.1->ydata-profiling->pandas-profiling) (2024.2) Requirement already satisfied: joblib>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from phik<0.13,>=0.11.1->ydata-profiling->pandas-profiling) (1.4.2) Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling->pandas-profiling) (0.7.0) Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling->pandas-profiling) (2.27.1) Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2->ydata-profiling->pandas-profiling) (4.12.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.24.0->ydata-profiling->pandas-profiling) (2024.8.30) Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.10/dist-packages (from statsmodels<1,>=0.13.2->ydata-profiling->pandas-profiling) (1.0.1) Requirement already satisfied: attrs>=19.3.0 in /usr/local/lib/python3.10/dist-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling->pandas-profiling) (24.2.0) Requirement already satisfied: networkx>=2.4 in /usr/local/lib/python3.10/dist-packages (from visions<0.7.7,>=0.7.5->visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling->pandas-profiling) (3.4.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib<3.10,>=3.5->ydata-profiling->pandas-profiling) (1.17.0) Requirement already satisfied: lightgbm in /usr/local/lib/python3.10/dist-packages (4.5.0) Requirement already satisfied: numpy>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.26.4) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.13.1)
/usr/local/lib/python3.10/dist-packages/dask/dataframe/__init__.py:42: FutureWarning: Dask dataframe query planning is disabled because dask-expr is not installed. You can install it with `pip install dask[dataframe]` or `conda install dask`. This will raise in a future version. warnings.warn(msg, FutureWarning)
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (0.9.0)
# Loading the dataset
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/PGP: AIML University Of Texas/Assigenment - credit card customer churn prediction/BankChurners.csv'
churner = pd.read_csv(file_path)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Checking the number of rows and columns in the data
churner.shape
(10127, 21)
additional_droppable_columns = [
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'
]
for col in additional_droppable_columns:
if col in churner.columns.unique().tolist():
churner.drop(columns=[col], inplace=True)
# Creating a copy dataset for analysis
data = churner.copy()
# let's view the first 5 rows of the data
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# let's view the last 5 rows of the data
data.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
# let's check the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
- There are a total of 21 columns and 10,127 observations in the dataset
- We can see that
Education_LevelandMarital_Statushave less than 10,127 non-null values i.e. columns have missing values.
# let's check for duplicate values in the data
data.duplicated().sum()
0
# let's check for missing values in the data
df_null_summary = pd.concat(
[data.isnull().sum(), data.isnull().sum() * 100 / data.isnull().count()], axis=1
)
df_null_summary.columns = ["Null Record Count", "Percentage of Null Records"]
df_null_summary[df_null_summary["Null Record Count"] > 0].sort_values(
by="Percentage of Null Records", ascending=False
).style.background_gradient(cmap="YlOrRd")
| Null Record Count | Percentage of Null Records | |
|---|---|---|
| Education_Level | 1519 | 14.999506 |
| Marital_Status | 749 | 7.396070 |
- No missing values
Let's check the number of unique values in each column
data.select_dtypes(include="object").nunique()
| 0 | |
|---|---|
| Attrition_Flag | 2 |
| Gender | 2 |
| Education_Level | 6 |
| Marital_Status | 3 |
| Income_Category | 6 |
| Card_Category | 4 |
data.select_dtypes(exclude="object").nunique()
| 0 | |
|---|---|
| CLIENTNUM | 10127 |
| Customer_Age | 45 |
| Dependent_count | 6 |
| Months_on_book | 44 |
| Total_Relationship_Count | 6 |
| Months_Inactive_12_mon | 7 |
| Contacts_Count_12_mon | 7 |
| Credit_Limit | 6205 |
| Total_Revolving_Bal | 1974 |
| Avg_Open_To_Buy | 6813 |
| Total_Amt_Chng_Q4_Q1 | 1158 |
| Total_Trans_Amt | 5033 |
| Total_Trans_Ct | 126 |
| Total_Ct_Chng_Q4_Q1 | 830 |
| Avg_Utilization_Ratio | 964 |
- Age has only
45unique values i.e. most of the customers are of similar age
# let's view the statistical summary of the numerical columns in the data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
- Mean value for the
Customer Agecolumn is approx 46 and the median is also 46. This shows that majority of the customers are under 46 years of age.Dependent Countcolumn has mean and median of~2Months on Bookcolumn has mean and median of36months.Minimumvalue is 13 months, showing that the dataset captures data for customers with the bank at least 1 whole yearsTotal Relationship Counthas mean and median of~4Credit Limithas a wide range of1.4K to 34.5K, the median being4.5K, way less than the mean8.6KTotal Transaction Counthas mean of~65and median of67
# let's view the statistical summary of the categorical columns in the data
data.describe(include="object").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
# Below function prints unique value counts and percentages for the category/object type variables
def category_unique_value():
for cat_cols in (
data.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().to_list()
):
print("Unique values and corresponding data counts for feature: " + cat_cols)
print("-" * 90)
df_temp = pd.concat(
[
data[cat_cols].value_counts(),
data[cat_cols].value_counts(normalize=True) * 100,
],
axis=1,
)
df_temp.columns = ["Count", "Percentage"]
print(df_temp)
print("-" * 90)
category_unique_value()
Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
Count Percentage
Attrition_Flag
Existing Customer 8500 83.934
Attrited Customer 1627 16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
Count Percentage
Gender
F 5358 52.908
M 4769 47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
Count Percentage
Education_Level
Graduate 3128 36.338
High School 2013 23.385
Uneducated 1487 17.275
College 1013 11.768
Post-Graduate 516 5.994
Doctorate 451 5.239
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Marital_Status
------------------------------------------------------------------------------------------
Count Percentage
Marital_Status
Married 4687 49.979
Single 3943 42.045
Divorced 748 7.976
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Income_Category
------------------------------------------------------------------------------------------
Count Percentage
Income_Category
Less than $40K 3561 35.163
$40K - $60K 1790 17.676
$80K - $120K 1535 15.157
$60K - $80K 1402 13.844
abc 1112 10.981
$120K + 727 7.179
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Card_Category
------------------------------------------------------------------------------------------
Count Percentage
Card_Category
Blue 9436 93.177
Silver 555 5.480
Gold 116 1.145
Platinum 20 0.197
------------------------------------------------------------------------------------------
- The target variable
Attrition FlaghasExisting to Attrited ratioof83.9 : 16.1. There is imbalance in the dataset~93%customers are havingBlue CardIncome Categoryhas a valueabcfor10%records, which we'll change toUnknown
data.drop(columns=["CLIENTNUM"], inplace=True)
Education Level and Marital Status¶Note:
The missing value treatment should be done after splitting the data into Train, Validation and Test sets. However, in this case, the treatment is generic, since we are filling in the data with Unknown. Thus, the treatment can be done on the overall dataset. Similar strategy is applicable for treating the Income Category column value abc
data["Education_Level"] = data["Education_Level"].fillna("Unknown")
data["Marital_Status"] = data["Marital_Status"].fillna("Unknown")
Income Category = abc¶data.loc[data[data["Income_Category"] == "abc"].index, "Income_Category"] = "Unknown"
category_unique_value()
Unique values and corresponding data counts for feature: Attrition_Flag
------------------------------------------------------------------------------------------
Count Percentage
Attrition_Flag
Existing Customer 8500 83.934
Attrited Customer 1627 16.066
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Gender
------------------------------------------------------------------------------------------
Count Percentage
Gender
F 5358 52.908
M 4769 47.092
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Education_Level
------------------------------------------------------------------------------------------
Count Percentage
Education_Level
Graduate 3128 30.888
High School 2013 19.878
Unknown 1519 15.000
Uneducated 1487 14.684
College 1013 10.003
Post-Graduate 516 5.095
Doctorate 451 4.453
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Marital_Status
------------------------------------------------------------------------------------------
Count Percentage
Marital_Status
Married 4687 46.282
Single 3943 38.936
Unknown 749 7.396
Divorced 748 7.386
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Income_Category
------------------------------------------------------------------------------------------
Count Percentage
Income_Category
Less than $40K 3561 35.163
$40K - $60K 1790 17.676
$80K - $120K 1535 15.157
$60K - $80K 1402 13.844
Unknown 1112 10.981
$120K + 727 7.179
------------------------------------------------------------------------------------------
Unique values and corresponding data counts for feature: Card_Category
------------------------------------------------------------------------------------------
Count Percentage
Card_Category
Blue 9436 93.177
Silver 555 5.480
Gold 116 1.145
Platinum 20 0.197
------------------------------------------------------------------------------------------
# let's check for missing values in the data
df_null_summary = pd.concat(
[data.isnull().sum(), data.isnull().sum() * 100 / data.isnull().count()], axis=1
)
df_null_summary.columns = ["Null Record Count", "Percentage of Null Records"]
df_null_summary[df_null_summary["Null Record Count"] > 0].sort_values(
by="Percentage of Null Records", ascending=False
).style.background_gradient(cmap="YlOrRd")
| Null Record Count | Percentage of Null Records |
|---|
All the null data values have been treated along with the incorrect/junk data in Income Category column
Converting the data type of the category variables from object/float to category
category_columns = data.select_dtypes(include="object").columns.tolist()
data[category_columns] = data[category_columns].astype("category")
Removing the spaces from column names, and standardizing the column names to lower case
data.columns = [i.replace(" ", "_").lower() for i in data.columns]
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 attrition_flag 10127 non-null category 1 customer_age 10127 non-null int64 2 gender 10127 non-null category 3 dependent_count 10127 non-null int64 4 education_level 10127 non-null category 5 marital_status 10127 non-null category 6 income_category 10127 non-null category 7 card_category 10127 non-null category 8 months_on_book 10127 non-null int64 9 total_relationship_count 10127 non-null int64 10 months_inactive_12_mon 10127 non-null int64 11 contacts_count_12_mon 10127 non-null int64 12 credit_limit 10127 non-null float64 13 total_revolving_bal 10127 non-null int64 14 avg_open_to_buy 10127 non-null float64 15 total_amt_chng_q4_q1 10127 non-null float64 16 total_trans_amt 10127 non-null int64 17 total_trans_ct 10127 non-null int64 18 total_ct_chng_q4_q1 10127 non-null float64 19 avg_utilization_ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
We'll move on to data analysis now.
The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary.
def summary(data: pd.DataFrame, x: str):
"""
The function prints the 5 point summary and histogram, box plot,
violin plot, and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
x: str, feature name
Usage:
------------
summary('age')
"""
x_min = data[x].min()
x_max = data[x].max()
Q1 = data[x].quantile(0.25)
Q2 = data[x].quantile(0.50)
Q3 = data[x].quantile(0.75)
dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
df = pd.DataFrame(data=dict, index=["Value"])
print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
print(tabulate(df, headers="keys", tablefmt="psql"))
fig = plt.figure(figsize=(16, 8))
plt.subplots_adjust(hspace=0.6)
sns.set_palette("Pastel1")
plt.subplot(221, frameon=True)
ax1 = sns.distplot(data[x], color="purple")
ax1.axvline(
np.mean(data[x]), color="purple", linestyle="--"
) # Add mean to the histogram
ax1.axvline(
np.median(data[x]), color="black", linestyle="-"
) # Add median to the histogram
plt.title(f"{x.capitalize()} Density Distribution")
plt.subplot(222, frameon=True)
ax2 = sns.violinplot(x=data[x], palette="Accent", split=True)
plt.title(f"{x.capitalize()} Violinplot")
plt.subplot(223, frameon=True, sharex=ax1)
ax3 = sns.boxplot(
x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True
)
plt.title(f"{x.capitalize()} Boxplot")
plt.subplot(224, frameon=True, sharex=ax2)
ax4 = sns.kdeplot(data[x], cumulative=True)
plt.title(f"{x.capitalize()} Cumulative Density Distribution")
plt.show()
summary(data, "customer_age")
5 Point Summary of Customer_age Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 26 | 41 | 46 | 52 | 73 | +-------+-------+------+------+------+-------+
The data is normally distributed, with only 2 outliers on the right side (higher end)
summary(data, "dependent_count")
5 Point Summary of Dependent_count Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 2 | 3 | 5 | +-------+-------+------+------+------+-------+
Dependent Countis mostly2 or 3
summary(data, "months_on_book")
5 Point Summary of Months_on_book Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 13 | 31 | 36 | 40 | 56 | +-------+-------+------+------+------+-------+
- Most customers are on the books for
3 years- There are outliers on both lower and higher end
summary(data, "total_relationship_count")
5 Point Summary of Total_relationship_count Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 1 | 3 | 4 | 5 | 6 | +-------+-------+------+------+------+-------+
Most of the customers have
4 or morerelations with the bank
summary(data, "months_inactive_12_mon")
5 Point Summary of Months_inactive_12_mon Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 2 | 2 | 3 | 6 | +-------+-------+------+------+------+-------+
- There are lower and higher end outliers for
Months inactive in last 12 months- Lower end outliers are not concerning since 0 value means the customer is always active. The customers who are inactive for 5 or more months are to be concerned about.
summary(data, "contacts_count_12_mon")
5 Point Summary of Contacts_count_12_mon Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 2 | 2 | 3 | 6 | +-------+-------+------+------+------+-------+
- Again lower and higher end outliers are noticed.
- Here less number of contacts between the bank and the customer should be interesting to be checked
summary(data, "credit_limit")
5 Point Summary of Credit_limit Attribute: +-------+--------+------+------+---------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+--------+------+------+---------+-------| | Value | 1438.3 | 2555 | 4549 | 11067.5 | 34516 | +-------+--------+------+------+---------+-------+
There are higher end outliers in
Credit Limit. This might be because the customers are high end.
data[data["credit_limit"] > 23000]["income_category"].value_counts(normalize=True)
| proportion | |
|---|---|
| income_category | |
| $80K - $120K | 0.421 |
| $120K + | 0.302 |
| $60K - $80K | 0.156 |
| Unknown | 0.110 |
| $40K - $60K | 0.012 |
| Less than $40K | 0.000 |
data[data["credit_limit"] > 23000]["card_category"].value_counts(normalize=True)
| proportion | |
|---|---|
| card_category | |
| Blue | 0.592 |
| Silver | 0.310 |
| Gold | 0.083 |
| Platinum | 0.015 |
The customers with
credit limitmore than23Khave~87%people earning$60K or more, and90%haveBlue or Silvercard
summary(data, "total_revolving_bal")
5 Point Summary of Total_revolving_bal Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 359 | 1276 | 1784 | 2517 | +-------+-------+------+------+------+-------+
Total revolving balanceof 0 would mean the customer never uses the credit card
summary(data, "avg_open_to_buy")
5 Point Summary of Avg_open_to_buy Attribute: +-------+-------+--------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+--------+------+------+-------| | Value | 3 | 1324.5 | 3474 | 9859 | 34516 | +-------+-------+--------+------+------+-------+
Average Open to Buyhas lots of higher end outliers, which means there are customers who uses only very small amount of their credit limit- Data is right skewed
summary(data, "total_amt_chng_q4_q1")
5 Point Summary of Total_amt_chng_q4_q1 Attribute: +-------+-------+-------+-------+-------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+-------| | Value | 0 | 0.631 | 0.736 | 0.859 | 3.397 | +-------+-------+-------+-------+-------+-------+
Outliers are on both higher and lower end
summary(data, "total_trans_amt")
5 Point Summary of Total_trans_amt Attribute: +-------+-------+--------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+--------+------+------+-------| | Value | 510 | 2155.5 | 3899 | 4741 | 18484 | +-------+-------+--------+------+------+-------+
Total Transaction Amounthas lots of higher end outliers
summary(data, "total_trans_ct")
5 Point Summary of Total_trans_ct Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 10 | 45 | 67 | 81 | 139 | +-------+-------+------+------+------+-------+
summary(data, "total_ct_chng_q4_q1")
5 Point Summary of Total_ct_chng_q4_q1 Attribute: +-------+-------+-------+-------+-------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+-------| | Value | 0 | 0.582 | 0.702 | 0.818 | 3.714 | +-------+-------+-------+-------+-------+-------+
Outliers are on both higher and lower end
summary(data, "avg_utilization_ratio")
5 Point Summary of Avg_utilization_ratio Attribute: +-------+-------+-------+-------+-------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+-------| | Value | 0 | 0.023 | 0.176 | 0.503 | 0.999 | +-------+-------+-------+-------+-------+-------+
Average utilizationis right skewed
For the categorical variables, it is best to analyze them at percentage of total on bar charts Below function takes a category column as input and plots bar chart with percentages on top of each bar
# Below code plots grouped bar for each categorical feature
def perc_on_bar(data: pd.DataFrame, cat_columns, target, hue=None, perc=True):
'''
The function takes a category column as input and plots bar chart with percentages on top of each bar
Usage:
------
perc_on_bar(df, ['age'], 'prodtaken')
'''
subplot_cols = 2
subplot_rows = int(len(cat_columns)/2 + 1)
plt.figure(figsize=(16,3*subplot_rows))
for i, col in enumerate(cat_columns):
plt.subplot(subplot_rows,subplot_cols,i+1)
order = data[col].value_counts(ascending=False).index # Data order
ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
for p in ax.patches:
percentage = '{:.1f}%\n({})'.format(100 * p.get_height()/len(data[target]), p.get_height())
# Added percentage and actual value
x = p.get_x() + p.get_width() / 2
y = p.get_y() + p.get_height() + 40
if perc:
plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium'); # Annotation on top of bars
plt.xticks(color='black', fontsize='medium', rotation= (-90 if col=='region' else 0));
plt.tight_layout()
plt.title(col.capitalize() + ' Percentage Bar Charts\n\n')
category_columns = data.select_dtypes(include="category").columns.tolist()
target_variable = "attrition_flag"
perc_on_bar(data, category_columns, target_variable)
High Imbalancein data since the existing vs. attrited customers ratio is 84:16- Data is almost equally distributed between
Males and Females31%customers areGraduate~85%customers areeither Single or Married, where46.7%of the customers areMarried35%customers earnless than $40kand36%earns$60k or more~93%customers haveBlue card
Goal of Bi-variate analysis is to find inter-dependencies between features.
# Below code plots box charts for each numerical feature by each type of Personal Loan (0: Not Borrowed, 1: Borroed)
def box_by_target(data: pd.DataFrame, numeric_columns, target, include_outliers):
"""
The function takes a category column, target column, and whether to include outliers or not as input
and plots bar chart with percentages on top of each bar
Usage:
------
perc_on_bar(['age'], 'prodtaken', True)
"""
subplot_cols = 2
subplot_rows = int(len(numeric_columns) / 2 + 1)
plt.figure(figsize=(16, 3 * subplot_rows))
for i, col in enumerate(numeric_columns):
plt.subplot(8, 2, i + 1)
sns.boxplot(
data=data,
x=target,
y=col,
orient="vertical",
palette="Blues",
showfliers=include_outliers,
)
plt.tight_layout()
plt.title(str(i + 1) + ": " + target + " vs. " + col, color="black")
numeric_columns = data.select_dtypes(exclude="category").columns.tolist()
target_variable = "attrition_flag"
box_by_target(data, numeric_columns, target_variable, True)
box_by_target(data, numeric_columns, target_variable, False)
Attrited customers have
- Lower
total transaction amount- Lower
total transaction count- Lower
utilization ratio- Lower
transaction count change Q4 to Q1- Higher
number of times contacted with or by the bank
# Create a function that returns a Pie chart and a Bar Graph for the categorical variables:
def cat_view(df: pd.DataFrame, x, target):
"""
Function to create a Bar chart and a Pie chart for categorical variables.
"""
from matplotlib import cm
color1 = cm.inferno(np.linspace(0.4, 0.8, 30))
color2 = cm.viridis(np.linspace(0.4, 0.8, 30))
sns.set_palette("cubehelix")
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
"""
Draw a Pie Chart on first subplot.
"""
s = data.groupby(x).size()
mydata_values = s.values.tolist()
mydata_index = s.index.tolist()
def func(pct, allvals):
absolute = int(pct / 100.0 * np.sum(allvals))
return "{:.1f}%\n({:d})".format(pct, absolute)
wedges, texts, autotexts = ax[0].pie(
mydata_values,
autopct=lambda pct: func(pct, mydata_values),
textprops=dict(color="w"),
)
ax[0].legend(
wedges,
mydata_index,
title=x.capitalize(),
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1),
)
plt.setp(autotexts, size=12)
ax[0].set_title(f"{x.capitalize()} Pie Chart")
"""
Draw a Bar Graph on second subplot.
"""
df = pd.pivot_table(
data, index=[x], columns=[target], values=["credit_limit"], aggfunc=len
)
labels = df.index.tolist()
no = df.values[:, 1].tolist()
yes = df.values[:, 0].tolist()
l = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
rects1 = ax[1].bar(
l - width / 2, no, width, label="Existing Customer", color=color1
)
rects2 = ax[1].bar(
l + width / 2, yes, width, label="Attrited Customer", color=color2
)
# Add some text for labels, title and custom x-axis tick labels, etc.
ax[1].set_ylabel("Scores")
ax[1].set_title(f"{x.capitalize()} Bar Graph")
ax[1].set_xticks(l)
ax[1].set_xticklabels(labels)
ax[1].legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax[1].annotate(
"{}".format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
fontsize="medium",
ha="center",
va="bottom",
)
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.show()
"""
Draw a Stacked Bar Graph on bottom.
"""
sns.set(palette="tab10")
tab = pd.crosstab(data[x], data[target], normalize="index")
tab.plot.bar(stacked=True, figsize=(16, 3))
plt.title(x.capitalize() + " Stacked Bar Plot")
plt.legend(loc="upper right", bbox_to_anchor=(0, 1))
plt.show()
cat_view(data, "gender", "attrition_flag")
- Attrition does not seem to be related with Gender
cat_view(data, "education_level", "attrition_flag")
- Attrition does not seem to be related with Education
cat_view(data, "marital_status", "attrition_flag")
- Attrition does not seem to be related with Marital Status
cat_view(data, "income_category", "attrition_flag")
- Attrition does not seem to be related with Income Category
cat_view(data, "card_category", "attrition_flag")
Platinumcard holder are appearing to be having attrition tendency, however, since there are only 20 data points for platinum card holders, this observation would be biased
# Below plot shows correlations between the numerical features in the dataset
plt.figure(figsize=(20, 20))
sns.set(palette="nipy_spectral")
sns.pairplot(data=data, hue="attrition_flag", corner=True)
<seaborn.axisgrid.PairGrid at 0x78fdb031e140>
<Figure size 2000x2000 with 0 Axes>
- There are clusters formed with respect to attrition for the variables
total revolving amount,total amount change Q4 to Q1,total transaction amount,total transaction count,total transaction count change Q4 to Q1- There are strong correlation between a few columns as well, which we'll check in below correlation heatmap.
# Plotting correlation heatmap of the features
codes = {'Existing Customer':0, 'Attrited Customer':1}
data_clean = data.copy()
data_clean['attrition_flag'] = data_clean['attrition_flag'].map(codes).astype(int)
data_clean = data_clean.select_dtypes(include=[np.number])
sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
data_clean.corr(),
annot=True,
linewidths=0.5,
center=0,
cbar=False,
cmap="YlGnBu",
fmt="0.2f",
)
plt.show()
Credit LimitandAverage Open to Buyhave 100% collinearityMonths on bookandCustomer Agehave quite strong correlationAverage Utilization RationandTotal Revolving Balanceare also a bit correlated it appearsAttrition Flagdoes not have highly strong correlation with any of the numeric variables- Customer Churn appears to be uncorrelated with
Customer Age,Dependent Count,Months on Book,Open to Buy,Credit Limit, we'll remove these from dataset
Pre-processing steps:
Client Number, Customer Age, Dependent Count, Months on Book, Open to Buy, Credit Limit)# Building a function to standardize columns
def feature_name_standardize(df: pd.DataFrame):
df_ = df.copy()
df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
return df_
# Building a function to drop features
def drop_feature(df: pd.DataFrame, features: list = []):
df_ = df.copy()
if len(features) != 0:
df_ = df_.drop(columns=features)
return df_
# Building a function to treat incorrect value
def mask_value(df: pd.DataFrame, feature: str = None, value_to_mask: str = None, masked_value: str = None):
df_ = df.copy()
if feature != None and value_to_mask != None:
if feature in df_.columns:
df_[feature] = df_[feature].astype('object')
df_.loc[df_[df_[feature] == value_to_mask].index, feature] = masked_value
df_[feature] = df_[feature].astype('category')
return df_
# Building a custom imputer
def impute_category_unknown(df: pd.DataFrame, fill_value: str):
df_ = df.copy()
for col in df_.select_dtypes(include='category').columns.tolist():
df_[col] = df_[col].astype('object')
df_[col] = df_[col].fillna('Unknown')
df_[col] = df_[col].astype('category')
return df_
# Building a custom data preprocessing class with fit and transform methods for standardizing column names
class FeatureNamesStandardizer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Returns dataframe with column names in lower case with underscores in place of spaces."""
X_ = feature_name_standardize(X)
return X_
# Building a custom data preprocessing class with fit and transform methods for dropping columns
class ColumnDropper(TransformerMixin):
def __init__(self, features: list):
self.features = features
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Given a list of columns, returns a dataframe without those columns."""
X_ = drop_feature(X, features=self.features)
return X_
# Building a custom data preprocessing class with fit and transform methods for custom value masking
class CustomValueMasker(TransformerMixin):
def __init__(self, feature: str, value_to_mask: str, masked_value: str):
self.feature = feature
self.value_to_mask = value_to_mask
self.masked_value = masked_value
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Return a dataframe with the required feature value masked as required."""
X_ = mask_value(X, self.feature, self.value_to_mask, self.masked_value)
return X_
# Building a custom class to one-hot encode using pandas
class PandasOneHot(TransformerMixin):
def __init__(self, columns: list = None):
self.columns = columns
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Return a dataframe with the required feature value masked as required."""
X_ = pd.get_dummies(X, columns = self.columns, drop_first=True)
return X_
# Building a custom class to fill nulls with Unknown
class FillUnknown(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Return a dataframe with the required feature value masked as required."""
X_ = impute_category_unknown(X, fill_value='Unknown')
return X_
Firstly we'll work on building models individually after data pre-processing, and later we'll build an ML pipeline to run end to end process of pre-processing and model building. We are creating a data copy for the first part.
df = churner.copy()
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | NaN | NaN | NaN | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.000 | NaN | NaN | NaN | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.000 | NaN | NaN | NaN | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.000 | NaN | NaN | NaN | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | NaN | NaN | NaN | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | NaN | NaN | NaN | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | NaN | NaN | NaN | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | NaN | NaN | NaN | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | NaN | NaN | NaN | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | NaN | NaN | NaN | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | NaN | NaN | NaN | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | NaN | NaN | NaN | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | NaN | NaN | NaN | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
# The static variables
# For dropping columns
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
]
# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"
# Random state and loss
seed = 1
loss_func = "logloss"
# Test and Validation sizes
test_size = 0.2
val_size = 0.25
# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
Here we are converting Object data type to Category
cat_columns = df.select_dtypes(include="object").columns.tolist()
df[cat_columns] = df[cat_columns].astype("category")
Splitting the dataset into dependent and independent variable sets
X = df.drop(columns=["Attrition_Flag"])
y = df["Attrition_Flag"].map(target_mapper)
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=test_size, random_state=seed, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(
"Training data shape: \n\n",
X_train.shape,
"\n\nValidation Data Shape: \n\n",
X_val.shape,
"\n\nTesting Data Shape: \n\n",
X_test.shape,
)
Training data shape: (6075, 20) Validation Data Shape: (2026, 20) Testing Data Shape: (2026, 20)
Checking the ratio of labels in the target column for each of the data segments
print("Training: \n", y_train.value_counts(normalize=True))
print("\n\nValidation: \n", y_val.value_counts(normalize=True))
print("\n\nTest: \n", y_test.value_counts(normalize=True))
Training: Attrition_Flag 0 0.839 1 0.161 Name: proportion, dtype: float64 Validation: Attrition_Flag 0 0.839 1 0.161 Name: proportion, dtype: float64 Test: Attrition_Flag 0 0.840 1 0.160 Name: proportion, dtype: float64
Data pre-processing is one of the the most important parts of the job before starting to train the model with the dataset. We need to impute missing values, fix any illogical data value in columns, convert category columns to numeric (either ordinal, or binary using one-hot encoding), scale the data to deal with the distribution skewness and outliers, before feeding the data to a model.
We are using the pre-available transformation classes and the custom classes that we created to first fit the training data and then transform the train, validation and test dataset. This is the standard logical practice to keep the influence of test and validation data in the train dataset to prevent/avoid data leakage while training or validating the model.
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()
X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
X_train = column_dropper.fit_transform(X_train)
X_val = column_dropper.transform(X_val)
X_test = column_dropper.transform(X_test)
# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)
X_train = value_masker.fit_transform(X_train)
X_val = value_masker.transform(X_val)
X_test = value_masker.transform(X_test)
# To impute categorical Nulls to Unknown
cat_columns = X_train.select_dtypes(include="category").columns.tolist()
imputer = FillUnknown()
X_train[cat_columns] = imputer.fit_transform(X_train[cat_columns])
X_val[cat_columns] = imputer.transform(X_val[cat_columns])
X_test[cat_columns] = imputer.transform(X_test[cat_columns])
# To encode the data
one_hot = PandasOneHot()
X_train = one_hot.fit_transform(X_train)
X_val = one_hot.transform(X_val)
X_test = one_hot.transform(X_test)
# Scale the numerical columns
robust_scaler = RobustScaler(with_centering=False, with_scaling=True)
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
"avg_utilization_ratio",
]
X_train[num_columns] = pd.DataFrame(
robust_scaler.fit_transform(X_train[num_columns]),
columns=num_columns,
index=X_train.index,
)
X_val[num_columns] = pd.DataFrame(
robust_scaler.transform(X_val[num_columns]), columns=num_columns, index=X_val.index
)
X_test[num_columns] = pd.DataFrame(
robust_scaler.transform(X_test[num_columns]),
columns=num_columns,
index=X_test.index,
)
X_train.head(3)
| total_relationship_count | months_inactive_12_mon | contacts_count_12_mon | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_trans_ct | total_ct_chng_q4_q1 | avg_utilization_ratio | gender_M | education_level_Doctorate | education_level_Graduate | education_level_High School | education_level_Post-Graduate | education_level_Uneducated | education_level_Unknown | marital_status_Married | marital_status_Single | marital_status_Unknown | income_category_$40K - $60K | income_category_$60K - $80K | income_category_$80K - $120K | income_category_Less than $40K | income_category_Unknown | card_category_Gold | card_category_Platinum | card_category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 3.000 | 4.000 | 3.000 | 1.226 | 2.044 | 0.648 | 1.278 | 2.249 | 0.168 | True | False | False | False | False | False | True | False | True | False | False | False | False | False | False | False | False | False |
| 498 | 3.000 | 2.000 | 0.000 | 1.450 | 1.697 | 0.524 | 0.861 | 2.667 | 1.376 | True | False | False | False | False | False | True | True | False | False | False | False | False | False | True | False | False | False |
| 4356 | 2.500 | 1.000 | 2.000 | 1.926 | 3.829 | 1.661 | 2.194 | 3.717 | 0.775 | True | False | False | True | False | False | False | True | False | False | False | False | True | False | False | False | False | False |
X_val.head(3)
| total_relationship_count | months_inactive_12_mon | contacts_count_12_mon | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_trans_ct | total_ct_chng_q4_q1 | avg_utilization_ratio | gender_M | education_level_Doctorate | education_level_Graduate | education_level_High School | education_level_Post-Graduate | education_level_Uneducated | education_level_Unknown | marital_status_Married | marital_status_Single | marital_status_Unknown | income_category_$40K - $60K | income_category_$60K - $80K | income_category_$80K - $120K | income_category_Less than $40K | income_category_Unknown | card_category_Gold | card_category_Platinum | card_category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2894 | 2.500 | 2.000 | 3.000 | 0.000 | 5.083 | 1.148 | 1.528 | 4.068 | 0.000 | True | False | False | False | True | False | False | False | True | False | False | False | True | False | False | False | False | False |
| 9158 | 0.500 | 3.000 | 1.000 | 0.000 | 3.982 | 3.148 | 1.639 | 3.810 | 0.000 | True | False | False | False | False | True | False | False | True | False | False | False | True | False | False | False | False | False |
| 9618 | 1.500 | 4.000 | 3.000 | 1.584 | 3.860 | 5.291 | 2.833 | 2.300 | 0.126 | True | False | False | False | False | True | False | True | False | False | False | False | False | False | False | False | True | False |
X_test.head(3)
| total_relationship_count | months_inactive_12_mon | contacts_count_12_mon | total_revolving_bal | total_amt_chng_q4_q1 | total_trans_amt | total_trans_ct | total_ct_chng_q4_q1 | avg_utilization_ratio | gender_M | education_level_Doctorate | education_level_Graduate | education_level_High School | education_level_Post-Graduate | education_level_Uneducated | education_level_Unknown | marital_status_Married | marital_status_Single | marital_status_Unknown | income_category_$40K - $60K | income_category_$60K - $80K | income_category_$80K - $120K | income_category_Less than $40K | income_category_Unknown | card_category_Gold | card_category_Platinum | card_category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9760 | 1.000 | 3.000 | 2.000 | 0.865 | 3.316 | 5.556 | 2.583 | 2.544 | 0.369 | True | False | False | True | False | False | False | False | True | False | False | False | True | False | False | False | False | False |
| 7413 | 2.000 | 3.000 | 2.000 | 0.000 | 3.219 | 0.850 | 1.139 | 2.190 | 0.000 | True | False | False | False | True | False | False | False | True | False | False | True | False | False | False | False | False | False |
| 6074 | 1.500 | 3.000 | 3.000 | 0.000 | 3.237 | 1.658 | 2.056 | 3.215 | 0.000 | False | False | False | True | False | False | False | True | False | False | True | False | False | False | False | False | False | False |
print(
"Training data shape: \n\n",
X_train.shape,
"\n\nValidation Data Shape: \n\n",
X_val.shape,
"\n\nTesting Data Shape: \n\n",
X_test.shape,
)
Training data shape: (6075, 27) Validation Data Shape: (2026, 27) Testing Data Shape: (2026, 27)
We are now all set to build, train and validate the model
Let's start by building different models using KFold and cross_val_score and tune the best model using RandomizedSearchCV
Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.We are creating a few functions to score the models, show the confusion matrix
def get_metrics_score(
model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True
):
"""
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict_proba(train)[:, 1] > threshold
pred_test = model.predict_proba(test)[:, 1] > threshold
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train, train_y)
test_acc = accuracy_score(pred_test, test_y)
train_recall = recall_score(train_y, pred_train)
test_recall = recall_score(test_y, pred_test)
train_precision = precision_score(train_y, pred_train)
test_precision = precision_score(test_y, pred_test)
train_f1 = f1_score(train_y, pred_train)
test_f1 = f1_score(test_y, pred_test)
pred_train_proba = model.predict_proba(train)[:, 1]
pred_test_proba = model.predict_proba(test)[:, 1]
train_roc_auc = roc_auc_score(train_y, pred_train_proba)
test_roc_auc = roc_auc_score(test_y, pred_test_proba)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
train_f1,
test_f1,
train_roc_auc,
test_roc_auc,
)
)
if flag == True:
print("Accuracy on training set : ", accuracy_score(pred_train, train_y))
print("Accuracy on test set : ", accuracy_score(pred_test, test_y))
print("Recall on training set : ", recall_score(train_y, pred_train))
print("Recall on test set : ", recall_score(test_y, pred_test))
print("Precision on training set : ", precision_score(train_y, pred_train))
print("Precision on test set : ", precision_score(test_y, pred_test))
print("F1 on training set : ", f1_score(train_y, pred_train))
print("F1 on test set : ", f1_score(test_y, pred_test))
if roc == True:
if flag == True:
print(
"ROC-AUC Score on training set : ",
roc_auc_score(train_y, pred_train_proba),
)
print(
"ROC-AUC Score on test set : ", roc_auc_score(test_y, pred_test_proba)
)
return score_list # returning the list with train and test scores
def make_confusion_matrix(model, test_X, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
test_X: test set
y_actual : ground truth
"""
y_predict = model.predict(test_X)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - Attrited", "Actual - Existing"]],
columns=[i for i in ["Predicted - Attrited", "Predicted - Existing"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(5, 3))
sns.heatmap(df_cm, annot=labels, fmt="", cmap="Blues").set(title="Confusion Matrix")
# # defining empty lists to add train and test results
model_names = []
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []
def add_score_model(model_name, score, cv_res):
"""Add scores to list so that we can compare all models score together"""
model_names.append(model_name)
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
roc_auc_train.append(score[8])
roc_auc_test.append(score[9])
cross_val_train.append(cv_res)
We are building 8 models here, Logistic Regression, Bagging, Random Forest, Gradient Boosting, Ada Boosting, Extreme Gradient Boosting, Decision Tree, and Light Gradient Boosting.
We are building below 7 models:
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Before is a diagrammatic representation by the makers of the Light GBM to explain the difference clearly.
Source: towards data science
models = [] # Empty list to store all the models
cv_results = []
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=seed)))
models.append(("Random forest", RandomForestClassifier(random_state=seed)))
models.append(("GBM", GradientBoostingClassifier(random_state=seed)))
models.append(("Adaboost", AdaBoostClassifier(random_state=seed)))
models.append(("Xgboost", XGBClassifier(random_state=seed, eval_metric=loss_func)))
models.append(("dtree", DecisionTreeClassifier(random_state=seed)))
models.append(("Light GBM", lgb.LGBMClassifier(random_state=seed)))
# For each model, run cross validation on 9 folds (+ 1 validation fold) with scoring for recall
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
cv_results.append(cv_result)
model.fit(X_train, y_train)
model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)
add_score_model(name, model_score, cv_result.mean())
print("Operation Completed!")
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 4589 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001110 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1451 [LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771 [LightGBM] [Info] Start training from score -1.653771 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000320 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1450 [LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771 [LightGBM] [Info] Start training from score -1.653771 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000290 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1451 [LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771 [LightGBM] [Info] Start training from score -1.653771 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000287 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1450 [LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771 [LightGBM] [Info] Start training from score -1.653771 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000331 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1449 [LightGBM] [Info] Number of data points in the train set: 5467, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160600 -> initscore=-1.653771 [LightGBM] [Info] Start training from score -1.653771 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000279 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1451 [LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633 [LightGBM] [Info] Start training from score -1.652633 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000311 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1449 [LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633 [LightGBM] [Info] Start training from score -1.652633 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000303 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1451 [LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633 [LightGBM] [Info] Start training from score -1.652633 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000286 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1449 [LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160753 -> initscore=-1.652633 [LightGBM] [Info] Start training from score -1.652633 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 4590 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000298 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1450 [LightGBM] [Info] Number of data points in the train set: 5468, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160571 -> initscore=-1.653989 [LightGBM] [Info] Start training from score -1.653989 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 976, number of negative: 5099 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000373 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1451 [LightGBM] [Info] Number of data points in the train set: 6075, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.160658 -> initscore=-1.653337 [LightGBM] [Info] Start training from score -1.653337 Operation Completed!
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | Light GBM | 0.851431 | 0.999012 | 0.969891 | 0.998975 | 0.880368 | 0.994898 | 0.928803 | 0.996933 | 0.903937 | 0.999996 | 0.993546 |
| 4 | Xgboost | 0.849379 | 0.999835 | 0.966436 | 1.000000 | 0.874233 | 0.998976 | 0.913462 | 0.999488 | 0.893417 | 1.000000 | 0.993060 |
| 2 | GBM | 0.817620 | 0.969712 | 0.969398 | 0.873975 | 0.874233 | 0.933260 | 0.931373 | 0.902646 | 0.901899 | 0.992689 | 0.989937 |
| 3 | Adaboost | 0.799137 | 0.956379 | 0.961007 | 0.830943 | 0.849693 | 0.890231 | 0.902280 | 0.859565 | 0.875197 | 0.987073 | 0.979432 |
| 0 | Bagging | 0.785862 | 0.996049 | 0.954590 | 0.980533 | 0.822086 | 0.994802 | 0.887417 | 0.987616 | 0.853503 | 0.999899 | 0.978021 |
| 1 | Random forest | 0.770440 | 1.000000 | 0.959526 | 1.000000 | 0.812883 | 1.000000 | 0.926573 | 1.000000 | 0.866013 | 1.000000 | 0.983956 |
| 5 | dtree | 0.754113 | 1.000000 | 0.937315 | 1.000000 | 0.806748 | 1.000000 | 0.804281 | 1.000000 | 0.805513 | 1.000000 | 0.884551 |
- The best model with respect to cross validation score and test recall is
Light GBM- The next best models are
XGBoost,GBMandAdaBoostrespectively
We are plotting the cross validation results for the 7 models in a Box plot, to check which models are potentially good.
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(cv_results)
ax.set_xticklabels(model_names)
plt.show()
It appears Light GBM, XGBoost, GBM are the models with good potential. Ada Boost also looks good with the higher end outlier performance score
Our dataset has a huge imbalance in target variable labels. To deal with such datasets, we have a few tricks up our sleeves, which we call Imbalanced Classification.
Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.
The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on the minority class, although typically it is performance on the minority class that is most important, which is the case in our study here.
One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy="minority", k_neighbors=10, random_state=seed
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 976 Before UpSampling, counts of label 'No': 5099 After UpSampling, counts of label 'Yes': 5099 After UpSampling, counts of label 'No': 5099 After UpSampling, the shape of train_X: (10198, 27) After UpSampling, the shape of train_y: (10198,)
We are building and training the same 7 models as before. We are however going to use the over-sampled training data for training the models.
models_over = []
# Appending models into the list
models_over.append(("Bagging UpSampling", BaggingClassifier(random_state=seed)))
models_over.append(
("Random forest UpSampling", RandomForestClassifier(random_state=seed))
)
models_over.append(("GBM UpSampling", GradientBoostingClassifier(random_state=seed)))
models_over.append(("Adaboost UpSampling", AdaBoostClassifier(random_state=seed)))
models_over.append(
("Xgboost UpSampling", XGBClassifier(random_state=seed, eval_metric=loss_func))
)
models_over.append(("dtree UpSampling", DecisionTreeClassifier(random_state=seed)))
models_over.append(("Light GBM UpSampling", lgb.LGBMClassifier(random_state=seed)))
for name, model in models_over:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
cv_result_over = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
cv_results.append(cv_result_over)
model.fit(X_train_over, y_train_over)
model_score_over = get_metrics_score(
model, X_train_over, X_val, y_train_over, y_val
)
add_score_model(name, model_score_over, cv_result_over.mean())
print("Operation Completed!")
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002346 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002937 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002888 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003073 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002666 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003065 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002903 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003000 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9178, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4590, number of negative: 4589 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004136 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9179, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500054 -> initscore=0.000218 [LightGBM] [Info] Start training from score 0.000218 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 4589, number of negative: 4590 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002916 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 9179, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499946 -> initscore=-0.000218 [LightGBM] [Info] Start training from score -0.000218 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 5099, number of negative: 5099 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003382 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 2331 [LightGBM] [Info] Number of data points in the train set: 10198, number of used features: 27 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 Operation Completed!
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | Light GBM UpSampling | 0.979799 | 0.997254 | 0.964956 | 0.998823 | 0.917178 | 0.995699 | 0.871720 | 0.997259 | 0.893871 | 0.999973 | 0.992508 |
| 9 | GBM UpSampling | 0.965483 | 0.970484 | 0.957552 | 0.975093 | 0.917178 | 0.966187 | 0.835196 | 0.970620 | 0.874269 | 0.995988 | 0.988800 |
| 10 | Adaboost UpSampling | 0.956068 | 0.952932 | 0.941264 | 0.960188 | 0.914110 | 0.946453 | 0.766067 | 0.953271 | 0.833566 | 0.991303 | 0.983562 |
| 11 | Xgboost UpSampling | 0.979604 | 0.999902 | 0.966930 | 1.000000 | 0.898773 | 0.999804 | 0.896024 | 0.999902 | 0.897397 | 1.000000 | 0.992938 |
| 8 | Random forest UpSampling | 0.981761 | 1.000000 | 0.956565 | 1.000000 | 0.895706 | 1.000000 | 0.843931 | 1.000000 | 0.869048 | 1.000000 | 0.985522 |
| 6 | Light GBM | 0.851431 | 0.999012 | 0.969891 | 0.998975 | 0.880368 | 0.994898 | 0.928803 | 0.996933 | 0.903937 | 0.999996 | 0.993546 |
| 4 | Xgboost | 0.849379 | 0.999835 | 0.966436 | 1.000000 | 0.874233 | 0.998976 | 0.913462 | 0.999488 | 0.893417 | 1.000000 | 0.993060 |
| 2 | GBM | 0.817620 | 0.969712 | 0.969398 | 0.873975 | 0.874233 | 0.933260 | 0.931373 | 0.902646 | 0.901899 | 0.992689 | 0.989937 |
| 7 | Bagging UpSampling | 0.959602 | 0.996960 | 0.943731 | 0.996862 | 0.861963 | 0.997058 | 0.802857 | 0.996960 | 0.831361 | 0.999969 | 0.973466 |
| 3 | Adaboost | 0.799137 | 0.956379 | 0.961007 | 0.830943 | 0.849693 | 0.890231 | 0.902280 | 0.859565 | 0.875197 | 0.987073 | 0.979432 |
| 0 | Bagging | 0.785862 | 0.996049 | 0.954590 | 0.980533 | 0.822086 | 0.994802 | 0.887417 | 0.987616 | 0.853503 | 0.999899 | 0.978021 |
| 12 | dtree UpSampling | 0.945871 | 1.000000 | 0.923001 | 1.000000 | 0.819018 | 1.000000 | 0.733516 | 1.000000 | 0.773913 | 1.000000 | 0.880980 |
| 1 | Random forest | 0.770440 | 1.000000 | 0.959526 | 1.000000 | 0.812883 | 1.000000 | 0.926573 | 1.000000 | 0.866013 | 1.000000 | 0.983956 |
| 5 | dtree | 0.754113 | 1.000000 | 0.937315 | 1.000000 | 0.806748 | 1.000000 | 0.804281 | 1.000000 | 0.805513 | 1.000000 | 0.884551 |
- The best 4 models with respect to validation recall and cross validation score, are as follows:
- Light GBM trained with over/up-sampled data
- GBM trained with over/up-sampled data
- AdaBoost trained with over/up-sampled data
- XGBoost trained with over/up-sampled data
Undersampling is another way of dealing with imbalance in the dataset.
Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset until a balanced dataset is created.
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 27) After Under Sampling, the shape of train_y: (1952,)
We are again building the same 7 models as before and training with the undersampled dataset, and use the validation dataset to score the models.
models_under = []
# Appending models into the list
models_under.append(("Bagging DownSampling", BaggingClassifier(random_state=seed)))
models_under.append(
("Random forest DownSampling", RandomForestClassifier(random_state=seed))
)
models_under.append(("GBM DownSampling", GradientBoostingClassifier(random_state=seed)))
models_under.append(("Adaboost DownSampling", AdaBoostClassifier(random_state=seed)))
models_under.append(
("Xgboost DownSampling", XGBClassifier(random_state=seed, eval_metric=loss_func))
)
models_under.append(("dtree DownSampling", DecisionTreeClassifier(random_state=seed)))
models_under.append(("Light GBM DownSampling", lgb.LGBMClassifier(random_state=seed)))
for name, model in models_under:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
cv_result_under = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
cv_results.append(cv_result_under)
model.fit(X_train_un, y_train_un)
model_score_under = get_metrics_score(model, X_train_un, X_val, y_train_un, y_val)
add_score_model(name, model_score_under, cv_result_under.mean())
print("Operation Completed!")
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000117 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1434 [LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000121 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1434 [LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000133 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1433 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000129 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1434 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000135 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1434 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000158 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1433 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000124 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1435 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000128 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1433 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000131 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1434 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000128 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1434 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 976, number of negative: 976 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000146 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1435 [LightGBM] [Info] Number of data points in the train set: 1952, number of used features: 26 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 Operation Completed!
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | Adaboost DownSampling | 0.925174 | 0.947746 | 0.936328 | 0.952869 | 0.963190 | 0.943205 | 0.728538 | 0.948012 | 0.829590 | 0.989348 | 0.985150 |
| 18 | Xgboost DownSampling | 0.947696 | 1.000000 | 0.938302 | 1.000000 | 0.960123 | 1.000000 | 0.736471 | 1.000000 | 0.833555 | 1.000000 | 0.989587 |
| 20 | Light GBM DownSampling | 0.953871 | 1.000000 | 0.939783 | 1.000000 | 0.957055 | 1.000000 | 0.742857 | 1.000000 | 0.836461 | 1.000000 | 0.991133 |
| 16 | GBM DownSampling | 0.951799 | 0.967725 | 0.938796 | 0.979508 | 0.957055 | 0.956957 | 0.739336 | 0.968101 | 0.834225 | 0.995357 | 0.989747 |
| 15 | Random forest DownSampling | 0.935388 | 1.000000 | 0.928430 | 1.000000 | 0.932515 | 1.000000 | 0.711944 | 1.000000 | 0.807437 | 1.000000 | 0.979840 |
| 14 | Bagging DownSampling | 0.920029 | 0.994365 | 0.924482 | 0.990779 | 0.932515 | 0.997936 | 0.698851 | 0.994344 | 0.798949 | 0.999701 | 0.972970 |
| 13 | Light GBM UpSampling | 0.979799 | 0.997254 | 0.964956 | 0.998823 | 0.917178 | 0.995699 | 0.871720 | 0.997259 | 0.893871 | 0.999973 | 0.992508 |
| 9 | GBM UpSampling | 0.965483 | 0.970484 | 0.957552 | 0.975093 | 0.917178 | 0.966187 | 0.835196 | 0.970620 | 0.874269 | 0.995988 | 0.988800 |
| 10 | Adaboost UpSampling | 0.956068 | 0.952932 | 0.941264 | 0.960188 | 0.914110 | 0.946453 | 0.766067 | 0.953271 | 0.833566 | 0.991303 | 0.983562 |
| 11 | Xgboost UpSampling | 0.979604 | 0.999902 | 0.966930 | 1.000000 | 0.898773 | 0.999804 | 0.896024 | 0.999902 | 0.897397 | 1.000000 | 0.992938 |
| 8 | Random forest UpSampling | 0.981761 | 1.000000 | 0.956565 | 1.000000 | 0.895706 | 1.000000 | 0.843931 | 1.000000 | 0.869048 | 1.000000 | 0.985522 |
| 19 | dtree DownSampling | 0.896423 | 1.000000 | 0.891412 | 1.000000 | 0.886503 | 1.000000 | 0.612288 | 1.000000 | 0.724311 | 1.000000 | 0.889428 |
| 6 | Light GBM | 0.851431 | 0.999012 | 0.969891 | 0.998975 | 0.880368 | 0.994898 | 0.928803 | 0.996933 | 0.903937 | 0.999996 | 0.993546 |
| 4 | Xgboost | 0.849379 | 0.999835 | 0.966436 | 1.000000 | 0.874233 | 0.998976 | 0.913462 | 0.999488 | 0.893417 | 1.000000 | 0.993060 |
| 2 | GBM | 0.817620 | 0.969712 | 0.969398 | 0.873975 | 0.874233 | 0.933260 | 0.931373 | 0.902646 | 0.901899 | 0.992689 | 0.989937 |
| 7 | Bagging UpSampling | 0.959602 | 0.996960 | 0.943731 | 0.996862 | 0.861963 | 0.997058 | 0.802857 | 0.996960 | 0.831361 | 0.999969 | 0.973466 |
| 3 | Adaboost | 0.799137 | 0.956379 | 0.961007 | 0.830943 | 0.849693 | 0.890231 | 0.902280 | 0.859565 | 0.875197 | 0.987073 | 0.979432 |
| 0 | Bagging | 0.785862 | 0.996049 | 0.954590 | 0.980533 | 0.822086 | 0.994802 | 0.887417 | 0.987616 | 0.853503 | 0.999899 | 0.978021 |
| 12 | dtree UpSampling | 0.945871 | 1.000000 | 0.923001 | 1.000000 | 0.819018 | 1.000000 | 0.733516 | 1.000000 | 0.773913 | 1.000000 | 0.880980 |
| 1 | Random forest | 0.770440 | 1.000000 | 0.959526 | 1.000000 | 0.812883 | 1.000000 | 0.926573 | 1.000000 | 0.866013 | 1.000000 | 0.983956 |
| 5 | dtree | 0.754113 | 1.000000 | 0.937315 | 1.000000 | 0.806748 | 1.000000 | 0.804281 | 1.000000 | 0.805513 | 1.000000 | 0.884551 |
- The 4 best models are:
- XGBoost trained with undersampled data
- AdaBoost trained with undersampled data
- Light GBM trained with undersampled data
- GBM trained with undersampled data
We will now try to tune these 4 models using Random Search CV
XGBoostwith down-sampling has the best validation recall of 96.3%, along-with 95% cross validation score on train, and 0.99 AUC, which means is it has high possibility of performing very well in unseen dataset. There is a bit of over-fitting, which I expect to resolve by tuning.
AdaBoostis generalizing the model very well, it is neither over-fitting, nor has any bias, AUC is 0.985 and cross validation score on train is 93%, recall on validation set is same as XGBoost (96.3%). I expect to improve the model (~94% on validation set) via tuning.
Light GBMworks really well in all aspects, but there is slight over-fitting problem, which I expect to resolve by tuning. Accuracy on validation is 94%, with cross validation score on train 95%, recall on validation ~96%, AUC is 0.99. This looks like a very promising model.
GBMis not overfitting, and neither it is suffering from bias or variance. Recall on validation is ~96%, accuracy on validation ~94%, AUC is ~0.99, cross validation score on train is ~95%. This would be my top choice because none of the training scores are 100%, meaning it is not trying to explain every single aspect of training data by overfitting it.
Typically a hyperparameter has a known effect on a model in the general sense, but it is not clear how to best set a hyperparameter for a given dataset. Further, many machine learning models have a range of hyperparameters and they may interact in nonlinear ways.
As such, it is often required to search for a set of hyperparameters that result in the best performance of a model on a dataset. This is called hyperparameter optimization, hyperparameter tuning, or hyperparameter search.
An optimization procedure involves defining a search space. This can be thought of geometrically as an n-dimensional volume, where each hyperparameter represents a different dimension and the scale of the dimension are the values that the hyperparameter may take on, such as real-valued, integer-valued, or categorical.
Search Space: Volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration. A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.
A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.
Random Search: Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.
Grid Search: Define a search space as a grid of hyperparameter values and evaluate every position in the grid.
%%time
# defining model
model = XGBClassifier(random_state=seed, eval_metric=loss_func)
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,500,50),
'scale_pos_weight':[2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05],
'gamma':[0,1,3,5],
'subsample':[0.8,0.9,1],
'max_depth':np.arange(4,20,1),
'reg_lambda':[5,10, 15, 20]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
xgb_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
xgb_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(xgb_tuned.best_params_,xgb_tuned.best_score_))
Best parameters are {'subsample': 1, 'scale_pos_weight': 10, 'reg_lambda': 10, 'n_estimators': 50, 'max_depth': 11, 'learning_rate': 0.01, 'gamma': 3} with CV score=1.0:
CPU times: user 3.72 s, sys: 372 ms, total: 4.09 s
Wall time: 2min 37s
# building model with best parameters
xgb_tuned_model = XGBClassifier(
n_estimators=150,
scale_pos_weight=10,
subsample=1,
reg_lambda=20,
max_depth=5,
learning_rate=0.01,
gamma=0,
eval_metric=loss_func,
random_state=seed,
)
# Fit the model on training data
xgb_tuned_model.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=5,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=150,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=0, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=5,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=150,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)xgb_tuned_model_score = get_metrics_score(
xgb_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
xgb_down_cv = cross_val_score(
estimator=xgb_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"XGB Tuned with Down Sampling", xgb_tuned_model_score, xgb_down_cv.mean()
)
make_confusion_matrix(xgb_tuned_model, X_val, y_val)
%%time
# defining model
model = AdaBoostClassifier(random_state=seed)
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators':np.arange(50,2000,50),
'learning_rate':[0.01,0.1,0.2,0.05]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
ada_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
ada_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(ada_tuned.best_params_,ada_tuned.best_score_))
Best parameters are {'n_estimators': 1050, 'learning_rate': 0.1} with CV score=0.9405743740795287:
CPU times: user 20.1 s, sys: 2.67 s, total: 22.7 s
Wall time: 29min 46s
# building model with best parameters
ada_tuned_model = AdaBoostClassifier(
n_estimators=1050, learning_rate=0.1, random_state=seed
)
# Fit the model on training data
ada_tuned_model.fit(X_train_un, y_train_un)
AdaBoostClassifier(learning_rate=0.1, n_estimators=1050, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(learning_rate=0.1, n_estimators=1050, random_state=1)
ada_tuned_model_score = get_metrics_score(
ada_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
ada_down_cv = cross_val_score(
estimator=ada_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"AdaBoost Tuned with Down Sampling", ada_tuned_model_score, ada_down_cv.mean()
)
make_confusion_matrix(ada_tuned_model, X_val, y_val)
%%time
# defining model
model = lgb.LGBMClassifier(random_state=seed)
# Hyper parameters
min_gain_to_split = [0.01, 0.1, 0.2, 0.3]
min_data_in_leaf = [10, 20, 30, 40, 50]
feature_fraction = [0.8, 0.9, 1.0]
max_depth = [5, 8, 15, 25, 30]
extra_trees = [True, False]
learning_rate = [0.01,0.1,0.2,0.05]
# Parameter grid to pass in RandomizedSearchCV
param_grid={'min_gain_to_split': min_gain_to_split,
'min_data_in_leaf': min_data_in_leaf,
'feature_fraction': feature_fraction,
'max_depth': max_depth,
'extra_trees': extra_trees,
'learning_rate': learning_rate,
'boosting_type': ['gbdt'],
'objective': ['binary'],
'is_unbalance': [True],
'metric': ['binary_logloss'],}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
lgbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
lgbm_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(lgbm_tuned.best_params_,lgbm_tuned.best_score_))
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50
[LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01
[LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
[LightGBM] [Info] Number of positive: 976, number of negative: 976
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000145 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1433
[LightGBM] [Info] Number of data points in the train set: 1952, number of used features: 25
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Best parameters are {'objective': 'binary', 'min_gain_to_split': 0.01, 'min_data_in_leaf': 50, 'metric': 'binary_logloss', 'max_depth': 8, 'learning_rate': 0.2, 'is_unbalance': True, 'feature_fraction': 0.8, 'extra_trees': False, 'boosting_type': 'gbdt'} with CV score=0.9559330948874394:
CPU times: user 2.51 s, sys: 210 ms, total: 2.72 s
Wall time: 52.4 s
# building model with best parameters
lgbm_tuned_model = lgb.LGBMClassifier(
min_gain_to_split = 0.01,
min_data_in_leaf = 50,
feature_fraction = 0.8,
max_depth = 8,
extra_trees = False,
learning_rate = 0.2,
objective = 'binary',
metric = 'binary_logloss',
is_unbalance = True,
boosting_type = 'gbdt',
random_state = seed
)
# Fit the model on training data
lgbm_tuned_model.fit(X_train_un, y_train_un)
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 976, number of negative: 976 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001527 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1433 [LightGBM] [Info] Number of data points in the train set: 1952, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LGBMClassifier(extra_trees=False, feature_fraction=0.8, is_unbalance=True,
learning_rate=0.2, max_depth=8, metric='binary_logloss',
min_data_in_leaf=50, min_gain_to_split=0.01, objective='binary',
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LGBMClassifier(extra_trees=False, feature_fraction=0.8, is_unbalance=True,
learning_rate=0.2, max_depth=8, metric='binary_logloss',
min_data_in_leaf=50, min_gain_to_split=0.01, objective='binary',
random_state=1)lgbm_tuned_model_score = get_metrics_score(
lgbm_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
lgb_down_cv = cross_val_score(
estimator=lgbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"Light GBM Tuned with Down Sampling", lgbm_tuned_model_score, lgb_down_cv.mean()
)
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 878, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000157 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1432 [LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 878, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000150 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1432 [LightGBM] [Info] Number of data points in the train set: 1756, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000170 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1431 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000534 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1432 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000166 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1432 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 879, number of negative: 878 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000481 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1431 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500285 -> initscore=0.001138 [LightGBM] [Info] Start training from score 0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000527 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1433 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000553 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1431 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000545 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1432 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8 [LightGBM] [Info] Number of positive: 878, number of negative: 879 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000526 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1432 [LightGBM] [Info] Number of data points in the train set: 1757, number of used features: 25 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499715 -> initscore=-0.001138 [LightGBM] [Info] Start training from score -0.001138 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
make_confusion_matrix(lgbm_tuned_model, X_val, y_val)
[LightGBM] [Warning] min_data_in_leaf is set=50, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=50 [LightGBM] [Warning] min_gain_to_split is set=0.01, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=0.01 [LightGBM] [Warning] feature_fraction is set=0.8, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.8
%%time
# defining model
model = GradientBoostingClassifier(random_state=seed)
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10, 15]
# Parameter grid to pass in RandomizedSearchCV
param_grid={'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=10, random_state=seed, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
gbm_tuned.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(gbm_tuned.best_params_,gbm_tuned.best_score_))
Best parameters are {'n_estimators': 266, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 8} with CV score=0.9559120555438672:
CPU times: user 15.6 s, sys: 2.48 s, total: 18.1 s
Wall time: 22min 6s
# building model with best parameters
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="sqrt",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
# Fit the model on training data
gbm_tuned_model.fit(X_train_un, y_train_un)
GradientBoostingClassifier(max_depth=25, max_features='sqrt',
min_samples_leaf=15, n_estimators=700,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(max_depth=25, max_features='sqrt',
min_samples_leaf=15, n_estimators=700,
random_state=1)gbm_tuned_model_score = get_metrics_score(
gbm_tuned_model, X_train, X_val, y_train, y_val
)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
gbm_down_cv = cross_val_score(
estimator=gbm_tuned_model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
add_score_model(
"GBM Tuned with Down Sampling", gbm_tuned_model_score, gbm_down_cv.mean()
)
make_confusion_matrix(gbm_tuned_model, X_val, y_val)
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
for col in comparison_frame.select_dtypes(include="float64").columns.tolist():
comparison_frame[col] = round(comparison_frame[col] * 100, 0).astype(int)
comparison_frame.tail(4).sort_values(
by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | XGB Tuned with Down Sampling | 100 | 68 | 69 | 100 | 100 | 33 | 34 | 50 | 51 | 97 | 97 |
| 23 | Light GBM Tuned with Down Sampling | 96 | 96 | 94 | 100 | 96 | 78 | 75 | 88 | 84 | 100 | 99 |
| 24 | GBM Tuned with Down Sampling | 95 | 96 | 94 | 100 | 96 | 79 | 75 | 88 | 84 | 100 | 99 |
| 22 | AdaBoost Tuned with Down Sampling | 94 | 94 | 94 | 96 | 96 | 73 | 74 | 83 | 83 | 99 | 99 |
- The
XGBoost model with hyper parameter tuning and trained with undersampled dataset, has best recall on validation set of ~99%, but accuracy is lower than the human level accuracy (i,e, classifying everyone as non-attriting customers). Thus, we are not selecting this model as the final model
- The validation recall of ~97% is provided by the
GBM with hyper parameter tuning trained with undersampled dataset, has validation accuracy of ~94%, and precision of ~74%, Validation AUC ~99%, Cross Validation Mean of 96%. Also, the model is neither suffering from bias, nor variance. We are selectingGBM Tuned with Down Samplingmodel as our final model
feature_names = X_train.columns
importances = gbm_tuned_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Let's check the performance of the model on Test (unseen) dataset.
gbm_tuned_model_test_score = get_metrics_score(
gbm_tuned_model, X_train, X_test, y_train, y_test
)
final_model_names = ["gbm Tuned Down-sampled Trained"]
final_acc_train = [gbm_tuned_model_test_score[0]]
final_acc_test = [gbm_tuned_model_test_score[1]]
final_recall_train = [gbm_tuned_model_test_score[2]]
final_recall_test = [gbm_tuned_model_test_score[3]]
final_precision_train = [gbm_tuned_model_test_score[4]]
final_precision_test = [gbm_tuned_model_test_score[5]]
final_f1_train = [gbm_tuned_model_test_score[6]]
final_f1_test = [gbm_tuned_model_test_score[7]]
final_roc_auc_train = [gbm_tuned_model_test_score[8]]
final_roc_auc_test = [gbm_tuned_model_test_score[9]]
final_result_score = pd.DataFrame(
{
"Model": final_model_names,
"Train_Accuracy": final_acc_train,
"Test_Accuracy": final_acc_test,
"Train_Recall": final_recall_train,
"Test_Recall": final_recall_test,
"Train_Precision": final_precision_train,
"Test_Precision": final_precision_test,
"Train_F1": final_f1_train,
"Test_F1": final_f1_test,
"Train_ROC_AUC": final_roc_auc_train,
"Test_ROC_AUC": final_roc_auc_test,
}
)
for col in final_result_score.select_dtypes(include="float64").columns.tolist():
final_result_score[col] = final_result_score[col] * 100
final_result_score
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | gbm Tuned Down-sampled Trained | 95.687 | 94.472 | 100.000 | 97.538 | 78.837 | 75.297 | 88.166 | 84.987 | 99.815 | 99.234 |
The performance of the model with the test data is almost similar to the performance on the validation dataset.
make_confusion_matrix(gbm_tuned_model, X_test, y_test)
ROC AUC characteristic is important to understand how good the model is.
If the model is really good in identifying the classes, the Area Under Curve is really high, close to 1.
If the model can not distinguish the classes well, the Area Under Curve is really low, close to 0.5.
from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt
# Create and display the ROC curve
RocCurveDisplay.from_estimator(gbm_tuned_model, X_test, y_test)
plt.title("Receiver Operating Characteristic")
plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1], "b--")
plt.xlim([-0.05, 1])
plt.ylim([0, 1.05])
plt.ylabel("True Positive Rate")
plt.xlabel("False Positive Rate")
plt.show()
Our model appears to be really good, since the AUC is almost 1.
Now that we have finalized our model, we'll build a model pipeline to streamline all the steps of model building. We'll start will the initial dataset and proceed with the pipeline building steps.
Machine Learning (ML) pipeline, theoretically, represents different steps including data transformation and prediction through which data passes. The outcome of the pipeline is the trained model which can be used for making the predictions. Sklearn.pipeline is a Python implementation of ML pipeline. Instead of going through the model fitting and data transformation steps for the training and test datasets separately, we can use Sklearn.pipeline to automate these steps. Here is the diagram representing the pipeline for training our machine learning model based on supervised learning, and then using test data to predict the labels.
# The static variables
# Random state and loss
seed = 1
loss_func = "logloss"
# Test and Validation sizes
test_size = 0.2
val_size = 0.25
# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
df_pipe = churner.copy()
cat_columns = df_pipe.select_dtypes(include="object").columns.tolist()
df_pipe[cat_columns] = df_pipe[cat_columns].astype("category")
X = df_pipe.drop(columns=["Attrition_Flag"])
y = df_pipe["Attrition_Flag"].map(target_mapper)
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=test_size, random_state=seed, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=val_size, random_state=seed, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 20) (2026, 20) (2026, 20)
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
Attrition_Flag 0 0.839 1 0.161 Name: proportion, dtype: float64 Attrition_Flag 0 0.839 1 0.161 Name: proportion, dtype: float64 Attrition_Flag 0 0.840 1 0.160 Name: proportion, dtype: float64
under_sample = RandomUnderSampler(random_state=seed)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# For dropping columns
columns_to_drop = [
"clientnum",
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age",
]
# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"
# One-hot encoding columns
columns_to_encode = [
"gender",
"education_level",
"marital_status",
"income_category",
"card_category",
]
# Numerical Columns
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
"avg_utilization_ratio",
]
# Columns for null imputation with Unknown
columns_to_null_imp_unknown = ["education_level", "marital_status"]
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)
# Missing value imputation
imputer = FillUnknown()
# To encode the categorical data
one_hot = OneHotEncoder(handle_unknown="ignore")
# To scale numerical columns
scaler = RobustScaler()
# creating a transformer for feature name standardization and dropping columns
cleanser = Pipeline(
steps=[
("feature_name_standardizer", feature_name_standardizer),
("column_dropper", column_dropper),
("value_mask", value_masker),
("imputation", imputer),
]
)
# creating a transformer for data encoding
encode_transformer = Pipeline(steps=[("onehot", one_hot)])
num_scaler = Pipeline(steps=[("scale", scaler)])
preprocessor = ColumnTransformer(
transformers=[
("encoding", encode_transformer, columns_to_encode),
("scaling", num_scaler, num_columns),
],
remainder="passthrough",
)
# Model
gbm_tuned_model = GradientBoostingClassifier(
n_estimators=700,
max_features="sqrt",
max_depth=25,
min_samples_split=2,
min_samples_leaf=15,
random_state=seed,
)
# Creating new pipeline with best parameters
model_pipe = Pipeline(
steps=[
("cleanse", cleanser),
("preprocess", preprocessor),
("model", gbm_tuned_model),
]
)
# Fit the model on training data
model_pipe.fit(X_train_un, y_train_un)
Pipeline(steps=[('cleanse',
Pipeline(steps=[('feature_name_standardizer',
<__main__.FeatureNamesStandardizer object at 0x78fdb2bc5b70>),
('column_dropper',
<__main__.ColumnDropper object at 0x78fdb2bc4c40>),
('value_mask',
<__main__.CustomValueMasker object at 0x78fdb2bc5b40>),
('imputation',
<__main__.FillUnknown object at 0x78fdb2bc74f0>)])),
('preprocess',
ColumnTran...
RobustScaler())]),
['total_relationship_count',
'months_inactive_12_mon',
'contacts_count_12_mon',
'total_revolving_bal',
'total_amt_chng_q4_q1',
'total_trans_amt',
'total_trans_ct',
'total_ct_chng_q4_q1',
'avg_utilization_ratio'])])),
('model',
GradientBoostingClassifier(max_depth=25, max_features='sqrt',
min_samples_leaf=15,
n_estimators=700,
random_state=1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('cleanse',
Pipeline(steps=[('feature_name_standardizer',
<__main__.FeatureNamesStandardizer object at 0x78fdb2bc5b70>),
('column_dropper',
<__main__.ColumnDropper object at 0x78fdb2bc4c40>),
('value_mask',
<__main__.CustomValueMasker object at 0x78fdb2bc5b40>),
('imputation',
<__main__.FillUnknown object at 0x78fdb2bc74f0>)])),
('preprocess',
ColumnTran...
RobustScaler())]),
['total_relationship_count',
'months_inactive_12_mon',
'contacts_count_12_mon',
'total_revolving_bal',
'total_amt_chng_q4_q1',
'total_trans_amt',
'total_trans_ct',
'total_ct_chng_q4_q1',
'avg_utilization_ratio'])])),
('model',
GradientBoostingClassifier(max_depth=25, max_features='sqrt',
min_samples_leaf=15,
n_estimators=700,
random_state=1))])Pipeline(steps=[('feature_name_standardizer',
<__main__.FeatureNamesStandardizer object at 0x78fdb2bc5b70>),
('column_dropper',
<__main__.ColumnDropper object at 0x78fdb2bc4c40>),
('value_mask',
<__main__.CustomValueMasker object at 0x78fdb2bc5b40>),
('imputation',
<__main__.FillUnknown object at 0x78fdb2bc74f0>)])<__main__.FeatureNamesStandardizer object at 0x78fdb2bc5b70>
<__main__.ColumnDropper object at 0x78fdb2bc4c40>
<__main__.CustomValueMasker object at 0x78fdb2bc5b40>
<__main__.FillUnknown object at 0x78fdb2bc74f0>
ColumnTransformer(remainder='passthrough',
transformers=[('encoding',
Pipeline(steps=[('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['gender', 'education_level', 'marital_status',
'income_category', 'card_category']),
('scaling',
Pipeline(steps=[('scale', RobustScaler())]),
['total_relationship_count',
'months_inactive_12_mon',
'contacts_count_12_mon',
'total_revolving_bal', 'total_amt_chng_q4_q1',
'total_trans_amt', 'total_trans_ct',
'total_ct_chng_q4_q1',
'avg_utilization_ratio'])])['gender', 'education_level', 'marital_status', 'income_category', 'card_category']
OneHotEncoder(handle_unknown='ignore')
['total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'total_revolving_bal', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_trans_ct', 'total_ct_chng_q4_q1', 'avg_utilization_ratio']
RobustScaler()
[]
passthrough
GradientBoostingClassifier(max_depth=25, max_features='sqrt',
min_samples_leaf=15, n_estimators=700,
random_state=1)print(
"Accuracy on Test is: {}%".format(round(model_pipe.score(X_test, y_test) * 100, 0))
)
Accuracy on Test is: 94.0%
pred_train_p = model_pipe.predict_proba(X_train_un)[:, 1] > 0.5
pred_test_p = model_pipe.predict_proba(X_test)[:, 1] > 0.5
pred_train_p = np.round(pred_train_p)
pred_test_p = np.round(pred_test_p)
train_acc_p = accuracy_score(pred_train_p, y_train_un)
test_acc_p = accuracy_score(pred_test_p, y_test)
train_recall_p = recall_score(y_train_un, pred_train_p)
test_recall_p = recall_score(y_test, pred_test_p)
print("Recall on Test is: {}%".format(round(test_recall_p * 100, 0)))
Recall on Test is: 98.0%
mask = np.zeros_like(data_clean.corr(), dtype=bool)
mask[np.triu_indices_from(mask)] = True
sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
data_clean.corr(),
cmap=sns.diverging_palette(20, 220, n=200),
annot=True,
mask=mask,
center=0,
)
plt.show()